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Abstract 

Boolean networks have long been used as models of molecular networks and play an in- 
creasingly important role in systems biology. This paper describes a software package, 
Polynome, offered as a web service, that helps users construct Boolean network models 
based on experimental data and biological input. The key feature is a discrete analog of 
parameter estimation for continuous models. With only experimental data as input, the 
software can be used as a tool for reverse-engineering of Boolean network models from 
experimental time course data. 
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1. Introduction 

During the last decade finite dynamical systems, that is, discrete dynamical systems 
with a finite phase space, have been used increasingly in systems biology to model a va- 
riety of biochemical networks, such as metabolic, gene regulatory, and signal transduction 
networks. In many cases, the available data quantity and quality is not sufficient to build 
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detailed quantitative models such as systems of ordinary differential equations, which re- 
quire many parameters that are frequently unknown. In addition, discrete models tend to 
be more intuitive and more easily accessible to life scientists. Boolean networks and the 
more general so-called logical models are the main types of finite dynamical systems that 
have been used successfully in modeling biological networks. 

Discrete dynamical models of biological networks were first introduced by Kauffman 
who u se d Boo lean networks to study the dynamics of gene regulatory networks (IKauffmaru . 
1969bl fl [T993h . A gene is assumed to be in one of two states, expressed (1) or not expressed 



(0). The next state of a gene is determined by a Boolean function in terms of the current 
states of the gene and its immediate neighbors in the network. The state of a network in 
n variables is then a binary vector of length n, representing the state of each node of the 
network. Thus, there are 2" possible states. The dynamics of the network is represented 
by a directed graph on the 2" states, where each state has out-degree one, that is, each 
state is mapped to exactly one other state (possibly itself). 

Boolean models of biological systems are abundant, including gene regu l atory networks 
such as the segment pola rity network in the fruit fly (lAlbert and Othmei].[2003h . the cell 



cycle i n mammalian cells (Faure et al. . 2006 ). in budding yeast ( Li et al. . 20041). and fission 



yeast (IDavidich and Bornholdtl . 120071 ). and metab olic networks in E. coli f lSamal and Jain 



2008 



Barrett et al 



20051 ) and in S. cerevisiae (IHerrgard et all . l2006l ). Also, Boolean 



network models of signaling networks have recently been used to gain insight into d i fferen t 
mechanisms such as the molecular neur otransmitter signaling pathAv ay fiGupta et al.l . 120071 ) . 
the T cell receptor signaling pathway (jSaez- Rodriguez et all . l2007tL the signaling network 
for the long-term survival of cyt otoxic T lymphocyt es in humans (ILi et al.l . |2006| ) , and the 



np 

u 



20081 ). 



abscisic acid signaling pathway (IZhang et al.l 

Boolean models require less detailed information about the system to be modeled, so 
they can be used in cases where quantitative information is missing. They are also useful 
if qualitative predictions from the model are desired, such as whether a T cell becomes 
pro- or anti-infiammatory. Finally, Boolean models are very intuitive compared to models 
based on differential equations or other more sophisticated formalisms. It is also easier 
to explore their dynamics, at least for reasonably small models. On the other hand, an 
important disadvantage of Boolean models, and algebraic models in general, is that there 
are very few theoretical tools available for their construction. Typically, Boolean models 
are built by translating information from the literature into logical statements about the 
interactions of the different molecular species involved in the network. In many cases, the 
biological information about a particular network node might not be sufficient, however, 
to construct a logical function governing regulation. 

In the case of a continuous model, the remedy would be to insert a differential equation 
of specified form, e.g., mass action kinetics, with unspecified parameters. If experimental 
time course data are available one can then use one of several parameter estimation methods 
to determine those unspecified model parameters so that the model fits the given data. Data 
fit is determined by model simulation, using numerical integration of the equations in the 
model. The software package described in this paper addresses the need for a discrete 
analog of this process. 



2 



In the case of missing information about a particular node in the network to be mod- 
eled one can insert a general Boolean function, maybe of a specified type, e.g., a nested 
canalyzing function. This is most easily done by viewing the Boolean function as a gen- 
eral polynomial, with undetermined (0/1) coefficients. If experimental time course data 
for the network is available, then one can use one of several existing inference methods to 
estimate a function that will result in a model that fits the data. This function in addition 
satisfies a specified optimality criterion, similar to the optimality criterion for the fitting of 
continuous parameters. This process might be considered the discrete analog of parameter 
estimation. 

In this paper we describe a software package, Polynome, which can be used for this 
purpose. The package integrates several existing algorithms for parameter estimation and 
model simulation. Space limitations do not allow a detailed self-contained description 
of each of the algorithms, most of whom have already appeared elsewhere. But enough 
detail is given so that the potential user understands the capabilities and hmitations of 
the package. We conclude the paper with an example application of the package to the 
well-known lac operon, the network that regulates lactose metabolism E. coli . We use 
data generated from a Boolean model of this network to illustrate software performance. 



2. Architecture 

In this section we introduce the architecture of the software package Polynome which 
integrates algorithms that perform discrete parameter estimation, or system identification, 
and simulation. A web interface of the software is available at 



http: //polymath. vbi .vt . edu/polynome/ 

The algorithms underlying the software represent Boolean networks as time discrete 
djTiamical systems as follows. Let k = {0, 1} be the field with two elements and arithmetic 
modulo 2. A Boolean network in n variables is a function 



/ = (A,...,/„):F 



with fi e k[xi, . . . , Xn]- It is easy to see that any Boolean function can be represented as 
a polynomial with coefficients in k. Furthermore, this polynomial can be chosen so that 
the variables appear only to the first power. Two directed graphs are associated to this 
function. The wiring diagram has as nodes the variables, and there is a directed edge i —>■ j 
if Xi appears in fj. The state space of / has as nodes all 2" binary strings in k^. There is 
a directed arrow a ^ b if /(a) = b. 

There are two stochastic versions of Boolean networks that are relevant here. The first is 
update-stochastic networks. Here, rather than updating the variables synchronously, they 
are updated asynchronously, using a randomly chosen update order. Update-stochastic 
Boolean networks ha ve been shown to capture interesting aspects of biological networks 
(IChaves et al.l . l2005l ) and stochastic sequent i al up date is used in the most general form 
of the logical models introduced in (jThomasl . Il973l ). The second kind, function-stochastic 
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Boolean networks are obtained by assigning a family of Boolean functions to each node, 
together with a probability distribution on the family. At each update, a function from 
this family is chosen at random. The software described here implements update stochas- 
tic networks as a subclass of function-stochastic networks by including for each function 
also the identify function. Whenever the identity function is chosen for an update, the 
corresponding variable is delayed, resulting in a sequential update. 
Figure [1] shows a flow chart of the software architecture. 




Figure 1: Flow chart of the software package Polynome. 
The input consists of two parts: 



1. time course data, either continuous or Boolean (mandatory input); 

2. A subset of the fi (optional input). 

If the input consists of continuous data, then the software Booleanizes the data first. The 
data need to be provided as a matrix with columns corresponding to nodes and rows 
correspond to experimental data points, whereas the functions are input as a list. 
There are several output options: 

1. A wiring diagram only, showing the dependency relations between the variables of 
the network; 

2. a deterministic Boolean network model, which either fits the data exactly or which 
optimizes between model complexity and data fit, and which can be simulated either 
deterministically or stochastically; 

3. a stochastic Boolean network model. 

The simulator has several capabilities. It can: 

1. simulate a deterministic Boolean network and output the wiring diagram and/or the 
state space; 

2. simulate a deterministic Boolean network using random sequential updates of the 
variables and output the state space with transition probabilities on the edges; 

3. simulate a function-stochastic Boolean network and output the state space with tran- 
sition probabilities on the edges. 

The first step is to preprocess the data: Booleanize them if necessary (see Section [3]) 
and remove any states (rows) which occur more than once. As there are many models 
which may fit a given data set, the set of possible models is typically very large, even with 
the minimality restriction. We offer three ways to search the model space: 

• minimal-model sampling, based on the Grobner fan sampling method (Algorithm [2]) 

• minimal-model estimation, based on the method for noisy data (Algorithm [3]) 

• minimal-model selection, based on the minimal-sets algorithm (Algorithm [1]) . 

For small networks (n < 10), the model space can be explored using one of two methods. 
Algorithm [2] is used to sample the subspace of minimal models and returns a set of weighted 
functions per node (for stochastic models) or a set of weighted inputs per node (for static 
models - dynamics not desired by the user). Algorithm [3] is used to estimate the minimal 
Boolean networks in the model space when inconsistent data are provided or a deterministic 
model is desired. What is returned is a polynomial dynamical system (PDS) that provides 
a best approximate data fit and is not overly complicated. For moderate to large networks 
(n > 10), the model space becomes too large to explore. So Algorithm [T] is used to identify 
a subset of essential variables, and only models involving those variables are considered 
subsequently. This algorithm returns either a minimal PDS (for large deterministic models) 
or a minimal wiring diagram (for static models - dynamics not desired by the user). Note 
that Algorithms [T] and [2] return PDSs that fit the data exactly. 
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Once a PDS has been identified, its dynamics can be simulated with one of the follow- 
ing modules: deterministic or stochastic simulation. Given a PDS, its wiring diagram is 
constructed using the GraphViz dot program. 



3. Data discretization 

Discretization of continuous experimental data into finitely many discrete states is 
important for inferring gene regulatory networks from expe rimental data. Discret ization 



of experimental data has been discussed ext ensively, e.g . , (Dimitrova et al.l . |2008| ). The 



following definition of discretization is due to iHarteminkl (120011 ). 



Definition 1. A discretization of a real- valued vector v = [vi, . . . , vn) is an integer-valued 
vector d = (di, . . . , d^) with the following properties: 

1. Each element of d is in the set 0, 1, . . . , — 1 for some (usually small) positive integer 
D, called the degree of the discretization. 

2. For all 1 < i,j < N, we have di < dj if and only if vt < vj. 

Without loss of generality, assume that v is sorted, i.e., for all i < j, Vi < Vj. Spanning 
discretizations of degree D are a special case that we consider here. They are defined in 



( iHarteminkl . 1200 ll ) as discretizations that satisfy the additional property that the small- 
est element of d is equal to and that the largest element of d is equal to D — 1. The 
translation from continuous to discrete data is crucial in preserving the variable depen- 
dencies and thus has a significant impact on the performance of the network inference 
algorithms. While there is a large selection of discretization methods available which clus- 
ter data points, many of them are not directly applicable in the network inference context 
or are not suitable. One important limitation is typically that the number of available 
data points is very small, typically consisting of less than 10 time points. We apply a 
newly developed method, ba sed on graph theory, especially designed for short time series 



data (iDimitrova et al.l . 120081 ). Novel aspects are incorporation of an information-theoretic 
criterion and a criterion to determine the optimal number of values. While the method 
can be used on other types of data, the motivation for its development was the need for a 
discretization algorithm for several short multivariate time courses of heterogeneous data, 
such as transcript, protein, and metabolite concentration measurements. Furthermore, the 
method has been demonstrated to preserve the dynamic features of the time courses, as 
well as to be robust to noise in the experimental data. 

The method begins by constructing a complete graph in which the vertices are the time 
points and the edge weights are the Euclidean distances between two vertices. Edges are 
deleted consecutively starting with the one of highest weight until the graph is discon- 
nected. The process continues until one of the several stop criteria are met. The goal is to 
minimize the average internal distance of the components and maximize the distance be- 
tween components. In the current work we have limited the number of states to 2, i.e., the 
data are Booleanized. The next version of Polynome will be capable of handling param- 
eter estimation for larger numbers of states. (The only parameter estimation algorithm 



6 




Figure 2: The complete weighted graph constructed from vector entries 1, 2, 7, 9, 10, 11. Only the edge 
weights of the outer edges are given. 



that is currently not capable of handling something other than binary states is REACT. A 
multi-state version is in preparation. However, in order to obtain useful performance, the 
computations will have to be performed in parallel on a multi-processor machine.) While 
Booleanization is a rather drastic transformation of the data and in many cases loses valu- 
able informatio n in the data, it can still derive useful information from experimental data, 
as is shown in IVera-Licona et al.l (120091 ). There, the authors use transcript data from a 
gene regulatory netw ork in yeast used to compare different reverse-engineering methods 
Cantone et al.l (120091 ). It is shown that the performance of REACT with a Booleanization 



of the data compares very favorably to the other methods tested. 

Example 1. Suppose that vectors = (1,2,7,9,10,11) is to be discretized. We start by 
constructing the complete weighted graph based on v. Eight edges with weights 10, 9, 9, 8, 
8, 7, 6, and 5, respectively, have to be deleted to disconnect the graph into two components: 
one containing vertices 1 and 2 and another having vertices 1, 9, 10, and 11; this is the 
first iteration. Having disconnected the graph, the next task is to determine if the obtained 
degree of discretization is sufficient; if not, the components need to be further disconnected 
in a similar manner to obtain a finer discretization. 

A commonly occurring phenomenon, when discretizing time courses, is that the result- 
ing data are inconsistent with a deterministic process. This happens because a given state 
can transition to two different states at different times. So, when a deterministic model is 
desired, these inconsistencies have to be removed. A common cause of such inconsistencies 
is small variations among consecutive time points, so that these get discretized into the 
same state. Eventually, there is sufficient change in the data so that a later discrete state 
becomes different again. This situation is dealt with by removing all but one instance of 
the repeated state. This essentially amounts to a local adjustment of time scale. Since time 
is not represented explicitly in discrete models, this is permissible. In the case of a given 
state transitioning to two different states in two different time courses, we remove the state 
in question, disconnecting the two time courses into four shorter ones. We further assume 
that there are no missing (unmeasured) time points. If new data points are included, then 
the parameter estimation process has to be restarted at the beginning. 
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4. Parameter estimation 



4.1. The minimal- sets algorithm 

Inferring the wiring diagram of a gene regulatory network has received lots of attention 
and many different methods in different contexts have been developed to address this 
problem. Using methods from computational algebra and algebraic geometry, we have 
developed an algorithm that first finds all possible minimal wiring diagrams of a gene 
regulatory network, and then chooses a particular model using different selection methods. 
Here a diagram is minimal if all specified interactions in the wiring diagram are necessary 
to have a function that interpolates the data. 

For a fixed gene, say Xj, we need to identify the minimal sets of genes which could be 
used as inputs to Xj. Let {(si, ti), . . . , (s^, t^)} be the stimuli-response data for the gene 
Xj, where Sj G k"',ti G k. We need to find all minimal subsets F C {1, . . . ,n} such that 
there exists a polynomial function / G k[{xi \ i G F}] with /(sj) = and there is no such 
polynomial on any proper subset of F. The main idea of the algorithm is the following: 
For any two stimuli and s;, such that ta 7^ th, identify all coordinates i such that Sai 7^ s^j. 
If none of these coordinates was picked previously, pick one of them. Once all possible pairs 
are considered, the set of variables corresponding to the chosen coordinates can be used to 
generate the req uired function. This can be done using computational algebra algorithms, 
as described in (ILaubenbacher and Stiglerl . |2004| ). 

Using tools from computational algebra, the steps above can be encoded as an algorithm 
that performs operations on a monomial ideal M with one generator for each pair of stimuli 



encoding their mismatch, and each minimal prime of M is a minimal set. See (jJarrah et al. 



20071 : IStider et all . l2007f ) 



Algorithm 1: Minimal-Sets Algorithm 
Input : {{si,ti), ... ,{Sm,tm)}, with Si e k'',ti e k 

Output: All minimal subsets F G {1, . . . ,n} such that there exists a polynomial 
function / G k[{xi \ i G F}] with /(sj) = tj. 

begin 

Step 1. Compute the ideal M. 

Step 2. Compute a primary decomposition of M. 

Step 3. Output the generating sets of all minimal primes of M. 

end 



It is clear that there will usually be many possible wiring diagrams and selecting a 
single model could be done only based on further assumptions, such as knowledge from 
the literature about some of the interactions, or the network being sparse. For the purpose 
of building a web-b ased applicat ion, we employ the (S'i,Ti) ranking scheme described in 
( Jarrah et al. . 2007 ) and used in f Stigler et al. . 20071 ) to select highest-scoring minimal sets. 
The scheme ranks sets according to size (smaller is better) and frequency of occurrence of 
the variables (higher is better). Given a list of highest-scoring sets of equal rank, we choose 
the first one in the list. 
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4-2. Parameter estimation using the Grobner fan of an ideal 

Typically there are many models that fit a time course of experimental data and often 
there is insufficient information to select one of them. Even restricting the model space to 
minimal models leaves multiple possible models. The reason is that different orderings of 
the polynomial terms (monomials) can give rise to different polynomial models, since the 
algorithm uses such an order for multivariate polynomial division. Algorithm [2] generates all 
Boolean network models fitting the data and, if needed, extracts the corresponding wiring 
diagram. The metho d is based on a comb i nator i al structu r e kno wn as the Grobner fan 
of a polynomial ideal (IMora and Robbiand . Il988l : ISturmfeld . Il996l ). which is a polyhedral 
complex of cones in which every point encodes a monomial ordering. The cones are in 
bijective correspondence with the distinct Grobner bases of an ideal. (To be precise, the 
correspondence is to the marked reduced Grobner bases of the ideal). Therefore, it is 
sufficient to select exactly one monomial ordering per cone and, ignoring the rest of the 
orderings, this still guarantees that all distinct minimal models are generated. In addition, 
the relative number of monomial orderings under which a particular PDS model is generated 
gives us insight into the likelihood that the model is a good representation of the system. 
Th e sizes of the Grobner c ones can be computed for small Grobner fans using the method 
of (lEickmeyer et al.l . l2008l ) or sampling uniformly a large number of points. The monomial 
orderings used in generating the models are selected through random sampling of the 
corresponding Grobner fan of the ideal of points. If the number of points is sufficiently 
large, their distribution approximately reflects the relative size of the Grobner cones. The 
number of points is determined using a t-test hypothesis testing for proportion. 

Steps 1-5 of Algorithm [2] are used for parameter estimation of a stochastic model of a 
system. If the wiring diagram is also required. Step 6 is performed as well. If the Grobner 
fan is too large to compute. Step 3 is replaced by a large random sample of points from 
the Grobner fan which reflects the relative sizes of the Grobner cones. 



4-3. Parameter Estimation in the Presence of Data Noise 

Since experimental data are typically noisy, due to measurement error or intrinsic bi- 
ological noise, robustnes s to noise of infere n ce me thods is desirable. Therefore, in order 
to avoid over-fitting, in (jVera-Licona et al.l . |2009| ) a method for parameter estimation is 
described, based on the premise that the input data may contain noise. In this genetic 
algorithm based method, a Boolean network model is inferred which is optimized with 
respect to both data fit and model complexity; an optimal Boolean model can also be 
constructed when prior knowledge about the network structure is included. 



Consider a network of n nodes. The main elements of the genetic algorithm (GA) are 

• Chromosomes of GA: Polynomial models, where a polynomial model is given as a 
system of n polynomial functions Fi, F2, . . . , 

• Genes of GA: Polynomial functions (hence there are n different classes of genes). 

• Each chromosome (polynomial model) in the GA contains n genes (n polynomial 
functions Fi, F2, . . . , F„). 
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Algorithm 2: Parameter sampling using the relative sizes of the Grobner cones. 
Input : A discrete time course of a network on n nodes xi, . . . , Xn'- 

S = {(Sii, . . . , Snl), • • • , {Sim, • • • ) Snm)} ^ 

Output: A list of length n^: {pij \ i, j = 1, . . . ,n} C [0, 1] where Pij is the strength 
with which Xi affects Xj. 

begin 

Step 1. Compute a particular Boolean network Fq : k"" ^ A;" that fits 5*. 

Step 2. Compute the ideal / of polynomials that vanish on S. 

Step 3. Compute the Grobner fan Q of the ideal / and the relative sizes of its 

cones, Ci, . . . ,Cs (with Ci + • • ■ + Cs=l). 
Step 4. Select one (any) monomial ordering from each cone, -<i, . . . , -Kg. For 

each i = 1, . . . , s, reduce Fq modulo / using a Grobner basis computed 

with respect to Let the reduced PDSs be Fi, . . . , and form the 

2-tuples {Fi,Ci). 

Step 5. If Fi = Fj for some i, j = 1, . . . , s, then "merge" the two 2-tuples into 
(Fj, Q + Cj). After re-indexing, obtain the set {(Fi, ci), . . . , {Ft, q)} 
where t < s and Fj ^ Fj for all i ^ j- 

Step 6. For every z, j = 1, . . . , n, find all Boolean networks F/^, . . . , F;^ whose 
dependency graph contains a directed edge Xi i— > Xj and let pij = 
Qi + ■ ■ ■ + Qd- If no Boolean network contains the edge Xj ^ Xj for 
some i and j, set p^j = 0. 

end 



Algorithm [3] summarizes the key features in this procedure. 

Some comments about the implementation of this algorithm in Polynome are in order. 

Model fitness. The fitness function used in the GA is a multi-objective function 
incorporating the different fitness criteria of each Boolean coordinate function for each 
node in the network as well as the fitness of the fully assembled Boolean models. The 
different criteria include: data fit, model complexity, and consistency with prior knowledge 
about the network structure. 

Parameters. The algorithm is controlled by a set of parameters that specify properties 
such as gene pool size and an upper bound for the maximum number of generations to run 
the GA. In the current version of the software, a default selection of parameters is made 
based on a preselected maximum network size. In future versions of the algorithm, the 
user will be able to control such parameters for an ad-hoc selection based on the specific 
network analyzed by the user. 

Termination criteria. Common concerns about genetic algorithms are the selection of 
termination criteria, how to avoid local minima, or to the computational resources required 
to run the algorithm for sufficiently many generations. Our stopping criterion is based on 
two parameters: a parameter that controls the maximum number of generations to be run 
and the maximum number of generations after which the algorithm is terminated if the 
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Algorithm 3: Parameter estimation in the presence of data noise 
Input : Boolean time courses (wildtype and/or knockout biological data). 

(Optional input: seeding polynomial models (seeding chromosomes) 

and/or prior knowledge of network structure) 
Output: Polynomial models with the best fitness scores. 

begin 

Step 1. If no seed is provided, generate random polynomial models until the 
pool size is reached. Otherwise, use input models and complete pool 
size with randomly generated models; 

Step 2. Evaluate and sort models according to their fitness; 

Step 3. Subdivide population into n sub populations, each corresponding to 
one of the n coordinate functions; 

Step 4. Select polynomials from each subpopulation corresponding to each 
of the coordinate functions and assemble them into new polynomial 
models to form the generation's offspring; 

Step 5. Mutate the assembled candidate models with inverse probability to 
the fitness of the generation, i.e., fewer mutations will take place in a 
generation that shows higher fitness scores; 

Step 6. Clone polynomial models selected to be preserved in the next gener- 
ation; 

Step 7. Build new population from candidate models, cloned models and ran- 
domly generated models; 

Step 8. Repeat Steps 2 through 7 until either the specified number of gen- 
erations is reached or the fitness score has not improved for a pre- 
determined number of generations. 

end 



fitness score has not improved. Both of these parameters have been selected based on the 
current limit on network size and the need of the web service version to provide output 
within a short period of time. It is important to mention that this limitation generally 
leads to models of lower quality than one would obtain from a free-standing version of the 
software. 

Output. The algorithm described in IVera-Licona et al.l (120091 ) provides as output all 
models with highest fitness score. Due to interface limitations, the implementation in 
Polynome outputs only ten of these models. Future versions of Polynome will allow the 
user to request more models. As with all evolutionary algorithms, it is not possible to prove 
convergence results, guarantee that the results are not just local optima, or guarantee that 
the search results are robust under repetition. These are all drawbacks of this method over 
the other methods available in Polynome. It is also possible that two models with the same 
score are very different from each other. On the other hand, this algorithm is the only one 
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that can produce models which do not have to fit the given data exactly, so they are more 
robust to noise. 



5. Simulation 

For the identified network, fixed points and limit cycles are calculated and the phase 
space and wiring diagram are generated. The analog of the graph of a function in the 
continuous case, e.g., solution of a system of differential equations, is the phase space 
for a discrete model. It visualizes the dynamics of the network, namely the fixed points 
and oscillatory cycles. In a deterministic network, each state has out-degree exactly 1, 
in a function stochastic network as generated by 14. 2^ the out-degree can be higher. In a 
stochastic network fixed points have a stability that is calculated from the probabilities of 
the update functions. The stability indicates how likely it is to remain in this state, so 
stability 1 corresponds to a true fixed point. 

For a function stochastic system a synchronous update is used, but for a deterministic 
network it is possible to use a sequential update order instead, in case more biological 
information about the network is available, such as the order in which certain molecular 
processes take place. If the user wants to use sequential update without providing an 
update order, then the software uses stochastic sequential update, that is, at each update 
an update order is chosen at random. As pointed out in the in troduction, sequenti al 
update has been shown to be biologically more realistic (see, e.g., Chaves et al. ( 20051 )). 



which is the reason why we are providing this simulation capability. However, it is easy to 
see that different update orders result in different dynamics, so that it is possible that a 
deterministic system that was chosen to fit a given set of experimental data will not do so 
any longer when simulated sequentially. 

6. An example: the Lac operon 

We demonstrate the key features of the software with an example. For simplicity, we 
choose an existing Boolean network model in order t o be able to compare the ou tput 



to the "true" network. We consider a Boolean model (jStigler and Veliz-Cubal . l2008l ) for 



lactose metabolism in the context of the lac operon for the foll owing two examples. Let 



/ be the 9-node Boolean model in fjStigler and Veliz-Cubal . |2008[ ) in terms of the variables 



{M, P, B,C, R, A, Al, L, Li) and the parameters {Lf.,Ge) (see the original manuscript for 
an introduction to the lac systems and a description of the model). For simplicity, we 
rename the variables clS (xi , . . . , Xg ) and the parameters as (xio, xn), and write the Boolean 
functions as in Table (H where fi represents the Boolean function associated to variable Xi 
and ~ is the logical NOT operator, * AND, and + OR. 

Note that functions in Boolean form, with binary operations * and + and unary oper- 
ation ~ as defined above, can be translated to polynomial form, with field operations -|- 
and X (also written as * in nonformatted text) via the mapping 
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J IvI 


= {r^ R)*C 


J i 


= Xt^l ^ Xa 


fp 


= M 


/2 


= Xi 


fs 


= M 


/3 


= Xi 


fc 


= ~ Ge 


h 


= ~ Xii 




= (~ A) * (~ Ai) 


h 


= (~ Xq) * (~ Xj) 


Ia 


= L* B 


h 


= Xs*X3 




= A + L + Li 


fr 


= Xq + Xg + Xg 


h 


= (~ Ge) *{P* Le) 


fs 


= (~ 2:11) * {X2 * Xio) 


hi 


= (~ Ge) * (L + Le) 


h 


= (~ 2:11) * (Xs+Xio) 



Table 1: Left panel: Boolean functions in the original variables. Right panel: Boolean functions in the 
indeterminates xi, . . . , xn. 



Boolean form 


Polynomial form 


~ X 

X * y 
X + y 


x + 1 
xy 

xy + X + y 



In the following example, we consider the case where we wish to identify some of the 
functions in a partially known network. 

Example 2. In the Boolean model in Table [21 the functions for M, P, B, C, and R are 
straight-forward from a biological perspective. However, this is not the case for the functions 
for lactose (L,Li) and allolactose (A,Ai). There is only one combination of values for 
extracellular glucose (Ge) and lactose (Le) for which the operon is ON; i.e.,Le = 1, Ge = 0. 
Setting the parameters to these values produces a single fixed point (1, 1, 1, 1, 0, 1, 1, 1, 1) in 
the above Boolean model. Further it limits what the functions for lactose (L, Li) could be, 
namely fL = P o,nd f^^ = 1. This leaves and f^^ open for investigation. 

For demonstrative purposes we chose the following data sets, which represent immediate 
initiation (G = 1) of the lac operon when applicable, to use in the parameter estimation 
step. 
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10 1 
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1 


1 





1 
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1 


1 
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1 


110 
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1 








1 





1 














1 1 1 


1 








1 





1 














1 1 1 


1 








1 


1 


1 














1 1 1 


1 





1 


1 


1 


1 














1 1 1 


1 





1 


1 


1 


1 














Transcription of lac genes 


Repression of operon 


1 


1 























110 











1 1 1 


1 


1 











1 








110 
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1 1 
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1 





1 


1 


1 








110 


1 





1 





1 





1 


1 


1 


1 








10 


1 





1 


1 


1 








1 





1 


1 





10 


1 





1 


Initiation by A, Ai 


Initiation by L, Li 





1 





1 


1 














10 





1 


1 


1 


1 








1 





1 


1 

1 
1 




1 



1 


110 
110 
10 
110 


1 
1 
1 
1 





1 




1 
1 
1 
1 



Given the above data, AlgorithmlMis used to identify the connections (see Figure\3^, as 
well as the functions in polynomial form for allolactose as follows: 




Figure 3: Inferred wiring diagram for allolactose. 



f6 = { 

x3*x8 0.422222 
x2*x8 0.577778 
} 

f7 = { 



14 



X4*x8+x5*x7+x8+x4*x5+x7 


, 


.0111111 


X7*x8+x8+x4*x7+x7+x4 


0. 


.0333333 


x6*x9+x6+x9 


0. 


A A A A A A 

.111111 


x7*x9+x7+x9 


0. 


t~\ 1^ r\ 1^ r\ f\ 

. 322222 


x4*x6+x6+x9 


0. 


.122222 


x6*x8+x6+x9 


0. 


. 177778 


X4*x8+x8+x4*x6+x6+x4 


0. 


.0222222 


X4*x8+x8+x4*x7+x7+x4 


0. 


, 0444444 


x4*x7+x7+x9 


0. 


.0666667 


X7*x8+x5*x7+x8+x4*x5+x7 


0, 


.0111111 


X6*x8+x4*x8+x8+x6+x4 


0. 


. 0444444 


X7*x8+x6*x8+x8+x6+x4 


0. 


.0222222 


X5*x7+x5*x9+x7+x9+x4 


0. 


.0111111 



} 



The figure represents only part of the full wiring diagram; we include only the subgraph 
of edges incident to nodes 6 and 7 for simplicity and include the isolated node xl to em- 
phasize that Xl is not an input of either xq or xj. The edge weights are values between 
and 1 and can be interpreted as relative likelihood of interaction or interaction strength. 
That is, if the weight on the edge from node i to node j is p, this means that the relative 
likelihood or strength with which i affects j is p. 

In the table of functions, again we only display the portion of the output that is relevant 
in this example. We find 2 possibilities for f^, one which matches the original function, 
and 13 possibilities for f-j. Note that the original for f'f can be written as 

2^8/7,3 + /7,3 + 2^8 

in polynomial form, or 

+ /7,3 

in Boolean form, where /y^s is the third element in the function list f7. The weights 
associated to each function can be interpreted similarly to the weights on the edges in the 
wiring diagram. Since we have multiple possibilities for each function, we can produce a 
stochastic simulation of the model. This yields the following dynamics: 

Number of components 1 

Number of fixed points 2 

Fixed point, component size, stability 

(11110110 1), 512, 0.08 

(111101111), 512, 1.00 

The first fixed point has very small stability and is therefore not reliable as a steady state. 
However, the second one, which corresponds to the unique fixed point in the original Boolean 
model, has stability 1 indicating that it is a true steady state. 
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Figure 4: Portion of the state space of the Boolean model simulated stochastically. 

In Figure [7] we show the portion of the state space. Since two of the functions are 
constant, we remove them for simplicity. We note that there is a greater probability (0.92) 
of encountering the state in which all molecules are present, as opposed to the low probability 
(0.08) of encountering the state in which all molecules are present except for Ai, which is 
biologically infeasible, from the state (1,1,1,1,0,0,0,1,1^. 

Example 3. Here we consider the case that we do not know any of the functions and aim 
to construct a deterministic dynamic model. Given that the data are consistent, the default 
strategy is to produce a model using the minimal-sets algorithm. However, for demonstrative 
purposes, we illustrate Algorithmic which is reserved for inconsistent or noisy data. Using 
the above data, we get the following model and its dynamics: 



fl = xl*x6*x7 + xl*x6 + x5 + 1 
f2 = xl + 

f3 = xl + x2 + x3*x4 + x4*x9 + x4 + 

f4 = 1 

f5 = x7 + 1 

f6 = x2*x4*x8 + x2*x6 + x4 + x5*x7*x8 + x5*x8 + x6*x9 + x6 + x9 + 
f7 = xl*x3*x6 + x6*x8 + x9 + 
f8 = x3 + 
f9 = 1 



There are 7 components and 3 fixed point (s) 



[001111011] lies in a component of size 6. 
[110100101] lies in a component of size 4. 
[110101101] lies in a component of size 124. 

R is important to mention that Algorithmic returns a list of the highest-scoring models 
(according to an internal fitness score). Hence in some instances like in this example, more 
than one model is returned with the same high score: 



^Italicized coordinates where removed from the figure. 
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fl = x5 + 1 
f2 = xl + 

f3 = xl + x2 + x3*x4 + x4*x9 + x4 + 
f4 = 1 

f5 = x7*x9 + 1 
f6 = x3*x8 + 
f7 = x4 + 

f8 = x2*x7 + x2 + x3*x7 + x5*x6*x8 + 
f9 = 1 



[110100101] lies in a component of size 256. 
[111101111] lies in a component of size 256. 



Both models have the highest ranking, meaning that they both fit the given data well, though 
not exactly. However, the second model has dynamics which resembles the original Boolean 
model: it only has fixed points (the first model has 4 nontrivial cycles) and one matches 
the unique fixed point of the Boolean model. 



7. Discussion 



We have presented a description of a software package to construct models of biological 
networks by fitting Boolean network models to time course experimental data. The software 
is offered as a web application. A detailed tutorial is available to help the user. Several other 
features will be incorporated in the next release, including the ability of the user to specify 
that the software return nested canalyzing Boolean functions, a pa rticular type of Boolea n 
function that was introduced by S. Kauffman and his collaborators (IKauffman et al.l . l2003l ). 
The next release will also allow the user to infer polynomial dynamical systems with more 
than two states. In addition, more useful graphical features will be introduced. 
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